New York City is full of urban wildlife, and rats are one of the city’s most infamous animals. Rats in NYC are plentiful and in September 2015 a viral video of a brown rat carrying a slice of pizza down the steps of a NYC subway station in Manhattan created the legend of Pizza Rat.

The source for the data of rat sightings in the city is from NYC’s 311 Service Requests from 2010 to Present. This is a dataset that is pretty much constantly updated and I downloaded this dataset on 11-Oct-2022. For this group project, you will use R, dplyr, and ggplot2 to explore and explain the data, and tell an interesting story. The data fields are shown below and a full data dictionary can be found here

Rows: 204,232
Columns: 45
$ unique_key                     <dbl> 40177738, 54095783, 52787670, 52046514,…
$ created_date                   <dttm> 2018-09-03 12:21:09, 2022-05-05 15:02:…
$ closed_date                    <chr> "09/14/2018 05:46:05 PM", "05/05/2022 0…
$ agency                         <chr> "DOHMH", "DOHMH", "DOHMH", "DOHMH", "DO…
$ agency_name                    <chr> "Department of Health and Mental Hygien…
$ complaint_type                 <chr> "Rodent", "Rodent", "Rodent", "Rodent",…
$ descriptor                     <chr> "Rat Sighting", "Rat Sighting", "Rat Si…
$ location_type                  <chr> "3+ Family Apt. Building", "Other (Expl…
$ incident_zip                   <dbl> 11229, 11216, 11201, 11222, 10009, 1141…
$ incident_address               <chr> "1213 AVENUE U", "1228 DEAN STREET", "2…
$ street_name                    <chr> "AVENUE U", "DEAN STREET", "WYCKOFF STR…
$ cross_street_1                 <chr> "EAST 12 STREET", "WILLIAM BILL HOWARD …
$ cross_street_2                 <chr> "HOMECREST AVENUE", "NEW YORK AVENUE", …
$ intersection_street_1          <chr> "EAST 12 STREET", "WILLIAM BILL HOWARD …
$ intersection_street_2          <chr> "HOMECREST AVENUE", "NEW YORK AVENUE", …
$ address_type                   <chr> "ADDRESS", "ADDRESS", "ADDRESS", "INTER…
$ city                           <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", NA,…
$ landmark                       <chr> NA, "DEAN STREET", "WYCKOFF STREET", NA…
$ facility_type                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ status                         <chr> "Closed", "Closed", "Closed", "In Progr…
$ due_date                       <chr> "10/03/2018 12:12:40 PM", NA, NA, NA, "…
$ resolution_action_updated_date <chr> "09/14/2018 05:46:05 PM", "05/05/2022 0…
$ community_board                <chr> "15 BROOKLYN", "08 BROOKLYN", "02 BROOK…
$ borough                        <chr> "Brooklyn", "Brooklyn", "Brooklyn", "Br…
$ x_coordinate_state_plane       <dbl> 995446, 998524, 986303, 999634, 990213,…
$ y_coordinate_state_plane       <dbl> 157321, 185855, 189477, 203643, 204768,…
$ park_facility_name             <chr> "Unspecified", "Unspecified", "Unspecif…
$ park_borough                   <chr> "BROOKLYN", "BROOKLYN", "BROOKLYN", "BR…
$ vehicle_type                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_company_borough           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxi_pick_up_location          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_name            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_direction       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ road_ramp                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bridge_highway_segment         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latitude                       <dbl> 40.59848, 40.67679, 40.68675, 40.72562,…
$ longitude                      <dbl> -73.95968, -73.94854, -73.99260, -73.94…
$ location                       <chr> "(40.598478991333735, -73.9596835550102…
$ sighting_date                  <date> 2018-09-03, 2022-05-05, 2021-12-13, 20…
$ sighting_year                  <dbl> 2018, 2022, 2021, 2021, 2018, 2018, 201…
$ sighting_month                 <dbl> 9, 5, 12, 10, 7, 7, 7, 7, 12, 12, 6, 4,…
$ sighting_month_name            <ord> September, May, December, October, July…
$ sighting_day                   <int> 3, 5, 13, 1, 12, 12, 15, 24, 26, 28, 6,…
$ sighting_weekday               <ord> Monday, Thursday, Monday, Friday, Thurs…
$ sighting_hour                  <int> 12, 15, 14, 18, 22, 9, 15, 13, 18, 12, …

1 Starter Code

I have provided some starter code below. A couple comments about it:

  • By default, vroom() treats cells that are empty or “NA” as missing values. This rat dataset uses “N/A” to mark missing values, so we need to add that as a possible marker of missingness (hence na = c("", "NA", "N/A"))
  • To make life easier, I always use janitor::clean_names() to remove spaces, upper case letters, etc. from variable names.
  • I’ve also created a few date-related variables (sighting_year, sighting_month, sighting_day, and sighting_weekday). You don’t have to use them, but they’re there if you need them. The functions that create these, like year() and wday() are part of the lubridate library.
  • The date/time variables are formatted like 09/14/2018 05:46:05 PM, which R is not able to automatically parse as a date when reading the CSV file. You can use the mdy_hms() function in the lubridate library to parse dates that are structured as “month-day-year-hour-minute”. There are also a bunch of other iterations of this function, like ymd(), dmy(), etc., for other date formats.
  • There’s a few rows (about 10) with an unspecified borough, so I filter them out.
  • I use str_to_title() to turn the upper case borough into a more legible format, from MANHATTAN to Manhattan and from STATEN ISLAND to Staten Island
library(tidyverse)
library(lubridate)

# If you get an error "All formats failed to parse. No formats found",
# it's because the mdy_hms function couldn't parse the date. The date
# variable *should* be in this format: "09/14/2018 05:46:05 PM", but in some
# rare instances, it might load without the seconds as "09/14/2018 05:00 PM".
# If there are no seconds, use mdy_hm() instead of mdy_hms().

rats <- vroom::vroom(here::here("data/Rat_Sightings.csv.zip"), na = c("", "NA", "N/A")) %>% 
  janitor::clean_names() %>% 
  mutate(created_date = mdy_hms(created_date)) %>%
  mutate(sighting_date = as.Date(created_date), # just the date without hour info
         sighting_year = year(created_date),
         sighting_month = month(created_date),
         sighting_month_name = month(created_date, label = TRUE, abbr = FALSE),
         sighting_day = day(created_date),
         sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE)
         ) %>%
  filter(borough != "Unspecified") %>%  
  mutate(borough = str_to_title(borough))

You’ll summarise the data with functions from dplyr, including stuff like count(), arrange(), filter(), group_by(), summarise(), and mutate(). Here are some examples of ways to summarise the data:

# See the count of rat sightings by weekday
rats %>%
  count(sighting_weekday)

# Assign a summarixed data frame to an object to use it in a plot
rats_by_weekday <- rats %>%
  count(sighting_weekday, sighting_year)


ggplot(rats_by_weekday, aes(x = fct_rev(sighting_weekday), y = n)) + 
  geom_col() +
  coord_flip() +
  facet_wrap(~ sighting_year)


# See the count of rat sightings by weekday and borough
rats %>%
  count(sighting_weekday, borough, sighting_year)

# An alternative to count() is to specify the groups with group_by() and then
# be explicit about how you're summarising the groups, such as calculating the
# mean, standard deviation, or number of observations (we do that here with
# `n()`).
rats %>%
  group_by(sighting_weekday, borough) %>%
  summarise(n = n())

2 Exploratory Data Analysis (EDA)

2.1 Background Info

In the R4DS Exploratory Data Analysis chapter, the authors state:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation…EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Conduct a thorough EDA. Recall that an EDA involves three things:

  • Looking at the raw values.
    • dplyr::glimpse()
  • Computing summary statistics of the variables of interest.
    • skimr::skim()
    • corrr::correlate()
    • mosaic::favstats()
  • Creating informative visualizations.
    • ggplot2::ggplot()
      • geom_histogram() or geom_density() for numeric continuous variables
      • geom_bar() or geom_col() for categorical variables
    • GGally::ggpairs() for scaterrlot/correlation matrix
      • Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = borough, alpha = 0.4)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should answer these questions:

  • How many variables/columns? How many rows/observations?
  • Which variables are numbers?
  • Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
  • What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

2.1.1 Plotting figures

For each table, make sure to include a relevant figure. One tip for starting is to draw out on paper what you want your x- and y-axis to be first and what your geom is; that is, start by drawing the plot you want ggplot to give you.

Your figure does not have to depict every last number from the data aggregation result. Use your judgement. It just needs to complement the table, add context, and allow for some sanity checking both ways.

Notice which figures are easy/hard to make, which data formats make better inputs for plotting functions vs. for human-friendly tables.

2.1.2 Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all rat sighting coordinates to get an overview of the spatial distribution of reported rat activity. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual location type and date/time of sighting. If you wanted to, you can learn more about leafelt the package that draws the interactive map, by following the relevant Datacamp course on mapping with leaflet

# let's get the top 7 location types, which account for > 90% of all cases
# this code generates a vector with the top 7 location types
top_location_types <- rats %>%
  count(location_type, sort=TRUE) %>%
  mutate(perc = 100*n/sum(n)) %>%
  slice(1:7) %>%
  select(location_type) %>%
  pull()

# lets us choose how to colour each point. What palette and colours to use? 
# A great site to get the relevant color hex codes for maps is 
# https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=7
my_colours <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628')

# create a function point_fill, that assigns `my_colours` to different location types
# you can read more here https://rstudio.github.io/leaflet/colors.html
point_fill <- colorFactor(palette = my_colours,  
                          rats$location_type)



rats %>%
  filter(sighting_year == 2021) %>% # just show 2021 data
  filter(location_type %in% top_location_types) %>%
  leaflet() %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   color = ~point_fill(location_type), 
                   fillOpacity = 0.6, 
                   popup = ~created_date,
                   label = ~location_type) %>%
  
  addLegend("bottomright", pal = point_fill, 
            values = ~location_type,
            title = "2021 Location Type",
            opacity = 0.5)

2.2 Your EDA task

Summarise the data somehow. The raw data has more than 200,000 rows, which means you’ll need to aggregate the data (filter(), group_by(), and summarise() are your friends). Consider looking at the number of sightings per borough, per year, per dwelling type, etc., or a combination of these, like the change in the number sightings across the 5 boroughs between, say, 2014 and 2021.

You can recreate one of these ugly, less-than-helpful graphs, or create a new story by looking at other variables in the data.

3 Inferential Statistics

Recall that the whole point of inferential statistics is we want to make a stateent/inferene about the population, given the sample statistics we have observed. In this case, our sample is the number of rat sightings by borough and year– surely there’s more rats, we just dont seem them all! Some people argue that rat sightings are related to human activity– where there’s food and junk that humans create, rats will surely appear.

3.1 Your task: Rat sightings vs a borough’s human population

How closely do rat sightings track the human population in a borough? Are the % of rat sightings in each borough related to that borough’s population? We got data from the 2020 census and can create a small tibble nyc_population with the relevant data.

# https://en.wikipedia.org/wiki/Boroughs_of_New_York_City
# NYC Boroughs, 2020 census data
# Bronx  1,472,654  
# Brooklyn 2,736,074    
# Manhattan 1,694,263   
# Queens 2,405,464  
# Staten Island  495,747

nyc_population <- tibble(
  borough = c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"),
  population = c(1472654,2736074,1694263,2405464,495747)
) 

Your task is to calculate Cofidence Intervals (CIs) for the % of rat sightings in each borough and in each year between 2014 and 2022. This will give you CIs for the percentage of NYC rats that exist in that borugh/year You then need to compare those CIs with the human population % as shown in the following graph.


To change the colour of the text in the title so it matches the orange dot corresponding to the population % , you need to use the ggtext package and the following code

library(ggtext)

# your ggplot code +

  labs(
    title = "Mouse sightings don't always track a <span style='color:#FFA500'><b>borough's population</span></b>"
  ) +

  theme(
    plot.title.position = "plot",
    plot.title = element_textbox_simple(size=16)) +
  NULL

4 Regression

Besides the variables included in the rats dataframe, we have also downloaded weather data for NYC which can be found at the nyc_weather dataframe. It may be the case that rats are more active when it’s warmer? or when it rains?

Build a regression model that helps you explain the number of sightings per day. You can use both the original data set, as well as the nyc_weather, the variables of which are shown below.

Rows: 3,734
Columns: 32
$ name             <chr> "New York", "New York", "New York", "New York", "New …
$ datetime         <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-01-04, 2010…
$ tempmax          <dbl> 4.3, 1.1, -5.5, -1.0, -0.8, 1.3, 3.2, 0.9, -0.8, -1.8…
$ tempmin          <dbl> 0.9, -8.0, -8.0, -6.8, -6.1, -3.0, -1.1, -4.8, -6.8, …
$ temp             <dbl> 2.4, -3.5, -6.9, -4.7, -3.7, -1.6, 0.7, -1.3, -4.6, -…
$ feelslikemax     <dbl> 4.1, -3.8, -13.2, -6.6, -6.1, -3.3, -0.7, 0.7, -5.1, …
$ feelslikemin     <dbl> -1.8, -17.3, -17.3, -14.8, -12.8, -9.0, -6.6, -11.4, …
$ feelslike        <dbl> 1.0, -10.5, -15.2, -11.3, -9.7, -7.1, -3.8, -5.5, -10…
$ dew              <dbl> -0.4, -10.2, -14.4, -12.4, -11.3, -9.5, -6.2, -6.5, -…
$ humidity         <dbl> 82.5, 60.2, 55.1, 55.3, 55.6, 55.0, 60.4, 68.5, 47.0,…
$ precip           <dbl> 2.40, 0.18, 0.00, 0.00, 0.00, 0.00, 0.00, 0.68, 0.00,…
$ precipprob       <dbl> 100, 100, 0, 0, 0, 0, 0, 100, 0, 0, 0, 0, 0, 0, 0, 0,…
$ precipcover      <dbl> 16.67, 4.17, 0.00, 0.00, 0.00, 0.00, 0.00, 25.00, 0.0…
$ preciptype       <chr> "rain,snow", "rain,snow", NA, NA, NA, NA, NA, "rain,s…
$ snow             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ snowdepth        <dbl> 3, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ windgust         <dbl> 37.1, 58.0, 60.6, 42.0, 40.7, 44.7, 48.2, 49.0, 40.7,…
$ windspeed        <dbl> 15.5, 35.1, 36.2, 27.1, 26.5, 27.1, 26.4, 32.2, 23.6,…
$ winddir          <dbl> 309.2, 295.4, 290.4, 294.4, 295.1, 299.2, 292.5, 281.…
$ sealevelpressure <dbl> 1011.7, 1004.9, 1000.6, 1005.7, 1006.4, 1006.3, 1011.…
$ cloudcover       <dbl> 99.6, 79.3, 73.3, 28.1, 50.1, 55.5, 50.2, 78.4, 1.5, …
$ visibility       <dbl> 10.6, 14.3, 15.3, 16.0, 16.0, 16.0, 16.0, 12.7, 16.0,…
$ solarradiation   <dbl> 80.1, 49.9, 36.3, 69.2, 112.9, 100.1, 114.5, 68.0, 12…
$ solarenergy      <dbl> 6.8, 4.1, 3.1, 5.9, 9.7, 8.7, 9.8, 5.9, 10.4, 10.6, 9…
$ uvindex          <dbl> 4, 2, 1, 3, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 1, 5,…
$ severerisk       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sunrise          <dttm> 2010-01-01 07:20:12, 2010-01-02 07:20:17, 2010-01-03…
$ sunset           <dttm> 2010-01-01 16:39:17, 2010-01-02 16:40:08, 2010-01-03…
$ moonphase        <dbl> 0.51, 0.53, 0.55, 0.60, 0.65, 0.70, 0.76, 0.81, 0.86,…
$ conditions       <chr> "Snow, Rain, Overcast", "Snow, Rain, Partially cloudy…
$ description      <chr> "Cloudy skies throughout the day with rain or snow cl…
$ icon             <chr> "rain", "rain", "partly-cloudy-day", "partly-cloudy-d…

In essence, we want to see how many rat sightings we have per day in each borough. Please use the following code to join the two dataframes before you run your regression model.

rats_weather <- rats %>%
  mutate(date = as.Date(created_date)) %>% 
  count(borough,date) %>%
  left_join(nyc_weather, by = c("date" = "datetime")) %>%
  mutate(month = month(date),
         month_name = month(date, label=TRUE),
         day = wday(date),
         day_of_week = wday(date, label=TRUE))

4.1 Your regression tasks

  • Use histograms or density plots to examine the distributions of n, the number of rat sightings, and log(n). Which variable should you use for the regression model? Why?

  • Fit a regression model called model1 with the following explanatory variables: temp.

    • Is the effect of temp significant? Why?
    • What proportion of the overall variability in rat sightings does temp explain?
  • Fit a regression model called model2 with the following explanatory variables: temp and borough

    • Is the effect of temp significant? Why?
    • What proportion of the overall variability in rat sightings does this model explain?
    • Why is Bronx missing from the boroughs in our model?
    • How do we interpret the coefficients for each of the boroughs?
    • What’s the difference between fitting the model of n versus log(n)
    • If you use a log transformation, the intrepretation of the betas/coefficients is slightly different. You can read more about that here FAQ HOW DO I INTERPRET A REGRESSION MODEL WHEN SOME VARIABLES ARE LOG TRANSFORMED?
  • Further variables/questions to explore on your own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  • Are other weather variables useful in explaining n?
  • We also have data on days of the week, month of the year, etc. Could those be helpful?
  • What’s the best model you can come up with?
  • Is this a regression model to predict or explain? If we use it to predict, what’s the Residual SE

4.2 Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using performance::check_model(model_x). You will always have some deviation from normality, especially for very high values of n
  2. As you start building models with more explanatory variables, make sure you use car::vif(model_x) to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
  3. Create a summary table, using huxtable (https://mfa2023.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.

5 Assessment

There is a detailed assessment rubric below.

It’s not enough for your code to run– it must be well documented and commented, so another person can read your work, reproduce your code, and easily understand what you are trying to achieve.